Skip to content

Conversation

@dmitripikus
Copy link
Contributor

When using completions API, for some contents of prompt, some models send response with empty text and one generated token, i.e. do not generate response as expected.
Example of problem:
Request:

curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "prompt": "USER: Hi, here is some system prompt: hi .Here are some other context:  hi.Here is question #1: can?ASSISTANT: Hi",
    "max_tokens": 20,
    "stream": true,
    "stream_options": {
      "include_usage": true
    },
    "temperature": 1
  }'

Model: meta-llama/Llama-3.1-70B-Instruct

To case model to continue test generation, "\nASSISTANT:" suffix is to be added to the prompt

@Siddhant-Ray
Copy link
Collaborator

@Shaoting-Feng PTAL too

Copy link
Collaborator

@Shaoting-Feng Shaoting-Feng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Shaoting-Feng Shaoting-Feng merged commit e140662 into LMCache:main Aug 8, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants